31 de marzo de 2019

Task overview

BUSINESS QUESTION: Which are the top 5 products that are going to be more profitable for the company?

What data do we have?

New product attributes and existing product attributes.

  • Predicting sales of four different product types: PC, Laptops, Netbooks and Smartphones
  • Assessing the impact services reviews and customer reviews have on sales of different product types

Index

  1. Data cleaning

  2. Data exploration

  3. Pre-process: feature selection (correlation matrix) & feature engineering

  4. Modalization: linear regresion, KNN, SVM, Random forest, GBM

  5. Error analysis

Data cleaning

Transformation to factor:

fact_var <- c("ProductType","ProductNum")
ex_prod[,fact_var] <- apply(ex_prod[,fact_var], 2, as.factor)

Giving names to the rows:

ex_prod <- tibble::column_to_rownames(.data = ex_prod,
                                     var = "ProductNum")
ex_prod$ProductNum <- NULL

Data cleaning: missing values with VIM

## 
##  Variables sorted by number of missings: 
##               Variable  Count
##        BestSellersRank 0.1875
##            ProductType 0.0000
##                  Price 0.0000
##          x5StarReviews 0.0000
##          x4StarReviews 0.0000
##          x3StarReviews 0.0000
##          x2StarReviews 0.0000
##          x1StarReviews 0.0000
##  PositiveServiceReview 0.0000
##  NegativeServiceReview 0.0000
##       Recommendproduct 0.0000
##         ShippingWeight 0.0000
##           ProductDepth 0.0000
##           ProductWidth 0.0000
##          ProductHeight 0.0000
##           ProfitMargin 0.0000
##                 Volume 0.0000

1st data expl.: Blackwell business

1st data expl.: Volume distribution

1st modalisation: linear regression

# train and test
train_id <- createDataPartition(y = ex_prod$Volume, p = 0.80, list = F)
train <- ex_prod[train_id,]
test <- ex_prod[-train_id,]

# create linear regression model
mod_lm <- lm(formula = Volume ~ ., data = train)

# model performance
postResample(pred = predict(object = mod_lm, newdata = test),
             obs = test$Volume)
##         RMSE     Rsquared          MAE 
## 6.413444e-14 1.000000e+00 3.896143e-14

Main predictors:

  1. 5 stars
  2. Product type: Game console

2nd pre-process: feature selection

2nd modalisation: linear regression

##         RMSE     Rsquared          MAE 
## 2.686487e-13 1.000000e+00 1.532404e-13

Main predictors:

  1. 5 stars
  2. Product type: PC
  3. Price

The model is overfitted again.

3rd pre-process: outlier detection in stars

3rd pre-process: feature engineering

3rd modalisation: linear regression

##        RMSE    Rsquared         MAE 
## 120.0594381   0.9787394  83.0056191
Variables used Main predictors
- Total number of stars - Total number of stars
- Positive service - Positive service
- Negative serice - Negative service
- Recommended product - Recommended product
- PC
- Laptop
- Netbook
- Smart Phone

3rd error check: errors in all the ex. prod

4th exploration: recommandation variable

4th pre-process: repeated observations

product_num ProductType total_stars Pos_Ser Neg_Ser Recomend Vol
132 ExtendedWarranty 4 0 3 0.4 0
133 ExtendedWarranty 8 0 1 0.6 20
134 ExtendedWarranty 361 280 8 0.9 1232
135 ExtendedWarranty 361 280 8 0.9 1232
136 ExtendedWarranty 361 280 8 0.9 1232
137 ExtendedWarranty 361 280 8 0.9 1232
138 ExtendedWarranty 361 280 8 0.9 1232
139 ExtendedWarranty 361 280 8 0.9 1232
140 ExtendedWarranty 361 280 8 0.9 1232
141 ExtendedWarranty 361 280 8 0.9 1232

4th feature engineering: pos. and neg. service

4th modalisation: linear regression

ex_prod_4mod <- ex_prod_dummy %>% 
  dplyr::select(Pos_Ser, Neg_Ser, Recomend, PC, Laptop, 
                Netb, Smart_Ph, total_stars, Vol)

# creating trainin and testing with the features selected
set.seed(123)
train_id <- createDataPartition(y = ex_prod_4mod$Vol,
                                p = 0.80,
                                list = F)
train <- ex_prod_4mod[train_id,]
test <- ex_prod_4mod[-train_id,]

# model creation
mod_4lm <- lm(formula = Vol ~., data = train)

# metrics
postResample(pred = predict(object = mod_4lm, newdata = test),
             obs = test$Vol)
##        RMSE    Rsquared         MAE 
## 151.3046880   0.8931816 100.0846194

The model has decrease the performance. Let's see how is performing to the categories we are interested.

4th error check: error visualization

5th modalization: k-Nearest neighbours

# modalization with knn
mod_5knn <- knn.reg(train[,sc_var], # predictores que usaremos para predecir 
                    test[,sc_var], # validadores, los predictores que queremos usar para validar
                    train$Vol, # salida que tiene que dar en el conjunto de entrenamiento
                    k = 1, # nĂºmero de vecinos
                    algorithm = "brute") # algoritmos que se pueden usar

# checking the metrics 
postResample(pred = mod_5knn$pred, obs = test$Vol)
##        RMSE    Rsquared         MAE 
## 218.9094181   0.7863258 123.6666667

5th modalisation: KNN choosing the best k

##       RMSE   Rsquared        MAE 
## 80.6583790  0.9658966 53.5454545